16年底,微软DMTK(分布式机器学习工具包)团队在GitHub上开源了性能超越其他boosting工具的LightGBM,在三天之内GitHub上被star了1000次,fork了200次,可见LightGBM的火爆程度。
在前面我们也说过,GBDT (Gradient Boosting Decision Tree)是机器学习中一个长盛不衰的模型,其主要思想是利用弱分类器(决策树)迭代训练以得到最优模型,该模型具有训练效果好、不易过拟合等优点。GBDT在工业界应用广泛,通常被用于点击率预测,搜索排序等任务。GBDT也是各种数据挖掘竞赛的致命武器,据统计Kaggle上的比赛有一半以上的冠军方案都是基于GBDT。Xgboost是GBDT的集大成者,但是LightGBM的出现挑战了Xgboost在“江湖”上的地位
LightGBM (Light Gradient Boosting Machine)(官方github,英文官方文档,中文官方文档)是一个实现GBDT算法的轻量级框架,支持高效率的并行训练,并且具有以下优点:
训练模型
import lightgbm as gbm
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
# read in the iris data
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 训练模型
model = gbm.LGBMClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')
model.fit(X_train, y_train)
# 对测试集进行预测
ans = model.predict(X_test)
model.score(X_test,y_test)
绘制模型重要性
plt.figure()
gbm.plot_importance(model)
plt.show()
绘制模型
plt.figure()
gbm.plot_tree(model)
plt.show()
训练模型
import lightgbm as gbm
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
# read in the iris data
iris = load_iris()
X = iris.data[:,:3]
y = iris.data[:,3]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 训练模型
model = gbm.LGBMRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True)
model.fit(X_train, y_train)
# 对测试集进行预测
ans = model.predict(X_test)
model.score(X_test,y_test)
绘制模型重要性
plt.figure()
gbm.plot_importance(model)
plt.show()
绘制模型
plt.figure()
gbm.plot_tree(model)
plt.show()
参考资料
import lightgbm as gbm
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import pandas as pd
# read in the iris data
iris = load_iris()
X = pd.DataFrame(iris.data,columns=iris.feature_names)
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 训练模型
model = gbm.LGBMClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')
model.fit(X_train, y_train)
# 对测试集进行预测
ans = model.predict(X_test)
model.score(X_test,y_test)
0.9666666666666667
plt.figure()
gbm.plot_importance(model)
plt.show()
<Figure size 640x480 with 0 Axes>
plt.figure()
gbm.plot_tree(model)
plt.show()
<Figure size 640x480 with 0 Axes>
import lightgbm as gbm
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
# read in the iris data
iris = load_iris()
X = pd.DataFrame(iris.data[:,:3],columns=iris.feature_names[:3])
y = iris.data[:,3]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 训练模型
model = gbm.LGBMRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True)
model.fit(X_train, y_train)
# 对测试集进行预测
ans = model.predict(X_test)
model.score(X_test,y_test)
0.9124320518930263
plt.figure()
gbm.plot_importance(model)
plt.show()
<Figure size 640x480 with 0 Axes>
plt.figure()
gbm.plot_tree(model)
plt.show()
<Figure size 640x480 with 0 Axes>